How do you define a demarcation between non-programmable graphics chips and programmable ones? We have seen that, even in the humble TNT days, there were a couple of user-defined opcodes with several possible input values.
One way is to consider what programming is. Programming is not simply a mathematical operation; programming needs conditional logic. Therefore, it is not unreasonable to say that something is not truly programmable until there is the possibility of some form of conditional logic.
And it is at this point where that first truly appears. It appears first in the vertex pipeline rather than the fragment pipeline. This seems odd until one realizes how crucial fragment operations are to overall performance. It therefore makes sense to introduce heavy programmability in the less performance-critical areas of hardware first.
The GeForce 3, released in 2001 (a mere 3 years after the TNT), was the first hardware to provide this level of programmability. While GeForce 3 hardware did indeed have the fixed-function vertex pipeline, it also had very flexible programmable pipeline. The retaining of the fixed-function code was a performance need; the vertex shader was not as fast as the fixed-function one. It should be noted that the original X-Box's GPU, designed in tandem with the GeForce 3, eschewed the fixed-functionality altogether in favor of having multiple vertex shaders that could compute several vertices at a time. This was eventually adopted for later GeForces.
Vertex shaders were pretty powerful, even in their first incarnation. While there was no conditional branching, there was conditional logic, the equivalent of the ?: operator. These vertex shaders exposed up to 128 vec4 uniforms, up to 16 vec4 inputs (still the modern limit), and could output 6 vec4 outputs. Two of the outputs, intended for colors, were lower precisions than the others. There was a hard limit of 128 opcodes. These vertex shaders brought full swizzling support and a plethora of math operations.
The GeForce 3 also added up to two more textures, for a total of four textures per triangle. They were hooked directly into certain per-vertex outputs, because the per-fragment pipeline did not have real programmability yet.
At this point, the holy grail of programmability at the fragment level was dependent texture access. That is, being able to access a texture, do some arbitrary computations on it, and then access another texture with the result. The GeForce 3 had some facilities for that, but they were not very good ones.
The GeForce 3 used 8 register combiner stages instead of the 2 that the earlier cards used. Their register files were extended to support two extra texture colors and a few more tricks. But the main change was something that, in OpenGL terminology, would be called “texture shaders.”
What texture shaders did was allow the user to, instead of accessing a texture, perform a computation on that texture's texture unit. This was much like the old texture environment functionality, except only for texture coordinates. The textures were arranged in a sequence. And instead of accessing a texture, you could perform a computation between that texture unit's coordinate and possibly the coordinate from the previous texture shader operation, if there was one.
It was not very flexible functionality. It did allow for full texture-space bump mapping, though. While the 8 register combiners were enough to do a full matrix multiply, they were not powerful enough to normalize the resulting vector. However, you could normalize a vector by accessing a special cubemap. The values of this cubemap represented a normalized vector in the direction of the cubemap's given texture coordinate.
But using that required spending a total of 3 texture shader stages. Which meant you get a bump map and a normalization cubemap only; there was no room for a diffuse map in that pass. It also did not perform very well; the texture shader functions were quite expensive.
True programmability came to the fragment shader from ATI, with the Radeon 8500, released in late 2001.
The 8500's fragment shader architecture was pretty straightforward, and in terms of programming, it is not too dissimilar to modern shader systems. Texture coordinates would come in. They could either be used to fetch from a texture or be given directly as inputs to the processing stage. Up to 6 textures could be used at once. Then, up to 8 opcodes, including a conditional operation, could be used. After that, the hardware would repeat the process using registers written by the opcodes. Those registers could feed texture accesses from the same group of textures used in the first pass. And then another 8 opcodes would generate the output color.
It also had strong, but not full, swizzling support in the fragment shader. Register combiners had very little support for swizzling.
This era of hardware was also the first to allow 3D textures. Though that was as much a memory concern as anything else, since 3D textures take up lots of memory which was not available on earlier cards. Depth comparison texturing was also made available.
While the 8500 was a technological marvel, it was a flop in the market compared to the GeForce 3 & 4. Indeed, this is a recurring theme of these eras: the card with the more programmable hardware often tends to lose in its first iteration.